Evolution of the Influenza A Virus: Some New Advances

Influenza is an RNA virus that causes mild to severe respiratory symptoms in humans and other hosts. Every year approximately half a million people around the world die from seasonal Influenza. But this number is substantially larger in the case of pandemics, with the most dramatic instance being the 1918 “Spanish flu” that killed more than 50 million people worldwide. In the last few years, thousands of Influenza genomic sequences have become publicly available, including the 1918 pandemic strain and many isolates from non-human hosts. Using these data and developing adequate bioinformatic and statistical tools, some of the major questions surrounding Influenza evolution are becoming tractable. Are the mutations and reassortments random? What are the patterns behind the virus’s evolution? What are the necessary and sufficient conditions for a virus adapted to one host to infect a different host? Why is Influenza seasonal? In this review, we summarize some of the recent progress in understanding the evolution of the virus.

February in the North Hemisphere and July-August in the Southern Hemisphere). This seasonal behavior contrasts with the constant background activity in Tropical Regions, (Alonso, Viboud, Simonsen, Hirano, Daufenbach and Miller, 2007;Nelson and Holmes, 2007;Viboud, Alonso and Simonsen, 2006;Wong, Yang, Chan, Leung, Chan, Guan, Lam, Hedley and Peiris, 2006). Despite the different seasonal behavior in tropical and non-tropical regions, annual infection rates and symptoms are similar. Why is epidemic Infl uenza seasonal? Is there an accurate method to predict next season's primary strains?
The third feature of Infl uenza evolution is the reassortment of the viral chromosomes. The genome of the Infl uenza virus contains eight single stranded negative RNA segments coding for ten or eleven proteins. When two or more different Infl uenza viruses co-infect the same host cell, new virions are produced that can contain the RNA from a combination of segments from all the parental strains (see Fig. 2). This mode of evolution is related to antigenic shift and it has caused at least two of the pandemics in the twentieth century. For instance, in 1957, the Asian fl u was a reassorted virus containing three segments from an avian strain (PB1, HA and NA) and the other fi ve from the virus that was already circulating in the human population (H1N1).
In this review we will discuss some recent advances related to the three features of Infl uenza evolution, host variety, high mutation rates, and reassortments. In the last few years a large international effort has been developed to make thousands of viral sequences publicly available. These sequences have been isolated all around the world during the past hundred years. Using the information in these databases we will present some of the patterns of evolution of this virus. The evolution of Infl uenza is not completely random, i.e. there are some structures or patterns that refl ect the biology of the virus, its interaction with different hosts and their immune systems.

Infl uenza and Its Different Hosts
Infl uenza is an RNA-(antisense) virus with a genome fragmented into eight different segments. These segments contain 10 or 11 open reading frames. The 3 longest segments contain the genes coding for the polymerase complex (PB2, PB1 and PA). Two other segments code for the proteins in the envelope of the virus Hemagglutinin (HA) and Neuraminidase (NA). These two proteins play a crucial role in the interaction of the virus with the host cell and the host immune system. Two genes in the same segment code for the two proteins that form the capsid (M1 and M2). The other three or four proteins, are ribonucleoprotein (NP) and proteins (NS1, NS2 and PB1-F2) that are not incorporated in the viral particle but are important in the interaction with the host cell. PB1-F2 is a proapoptotic protein that is not present in all Infl uenza A viruses (Chen, Calvo, Malide, Gibbs, Schubert, Bacik, Basta, O'Neill, Schickli, Palese et al. 2001;Zell, Krumbholz, Eitner, Krieg, Halbhuber and Wutzler, 2007). It is encoded in an alternative reading frame in the same segment that is encoding PB1. Classic swine and human H1N1 viruses have chain termination or STOP codons mutations in the middle of the gene. The nomenclature for the virus comes from the serotype classification based on the antibody response from the proteins on the surface of these viruses. There are 16 types of Hemagglutinins and 9 types of Neuraminidases. All of them can be found in birds but only H1, H2, H3 and N1 and N2 have been found in human epidemic Influenza. The prevalence of infection in certain avian populations, the wide variety of viral subtypes found, and a weaker immune response lead us to believe that the main reservoir of Influenza A is aquatic birds. Although in these birds transmission is through an oral-fecal route, in humans the virus spreads through droplets coming from the upper respiratory tract of an infected person. Occasionally an Infl uenza virus from one host jumps into a different host population. That happened in 1957 and in 1968 when several segments (PB1, HA andNA in 1957 andPB1 andHA in 1968) from avian viruses reassorted with the pre-existing human viruses. More controversial is the possibility of the direct jump of a complete Infl uenza virus to a new host. Taubenberger et al. have argued that the H1N1 fl u causing the 1918 pandemic (Spanish fl u) was an avian fl u that entered the human population (Taubenberger, Reid, Lourens, Wang, Jin and Fanning, 2005). This possibility has been questioned by several authors (Antonovics, Hood and Baker, 2006;Gibbs and Gibbs, 2006;Tumpey, Basler, Aguilar, Zeng, Solorzano, Swayne, Cox, Katz, Taubenberger, Palese et al. 2005).
It is not clear what the necessary and suffi cient conditions are that defi ne the host specifi city of one virus. One of these conditions is the receptor on the surface of the host cell, the sialic acid, which presents two versions alpha2-6 and alpha2-3, the former more common in humans and the latter in birds. Several positions have been mapped on the Hemagglutinin related to sialic acid specificity (Stevens, Blixt, Tumpey, Taubenberger, Paulson and Wilson, 2006).

The Drift of Infl uenza
Recent bioinformatics research has illuminated some genomic features that distinguish avian and human Infl uenza A viruses. Viruses whose primary hosts are avian or human have different nucleotide compositions (Rabadan, Levine and Robins, 2006). This difference in nucleotide composition is suffi cient to separate the thousands of sequenced human and avian viruses at almost 100% accuracy (See Fig. 3). The four sets of strains that fail to be classifi ed by this method are H5N1 Hong Kong, H9N2 Hong Kong, the recent H5N1 bird fl u, and Reassortment: when two different virus co-infect the same host cell they produce a new virus with a combination of both parental strains. When a virus from a host reassorts with a virus from another host they can create a potential pandemic virus. In this example, a human virus reassorted with an avian virus taking three of its segments. In particular, the segment coding for HA (indicated by a black arrow) in the resulting virus is of avian origin. A similar process happened in 1968 (the Asian fl u). the 1918 H1N1 virus. These are all known to have been avian viruses that recently had entered the human population and were not able to transmit from human to human, with the sole exception of the 1918 H1N1 virus. If we can understand why and how these viruses crossed over the avian to human line, then we will have a new tool to identify possible threats for new pandemics. Segment by segment analysis of nucleotide composition allows us to readily determine the reassortment of an avian segment into a human strain such as the PB1 gene on segment 2 in the 1957 and 1968 pandemics.
The human viruses have a higher percentage of Uracil and Adenine, whereas the avian viruses have a higher percentage of Guanine and Cytosine in their genomes (See Fig. 4) (Rabadan, Levine and Robins, 2006). One or more segments in each human strain were acquired by reassortment from a non-human virus, possibly avian. The nucleotide composition changes in the reassorted segments, probably due to a biased substitution rate (C-ϾU and G-ϾA) in human hosts relative to avian. Because of the availability of sequenced strains that span the last 90 years, we actually can observe the steady increase of U and A along with the decrease in C and G over time as the viral subtype evolves in its human host (Rabadan, Levine and Robins, 2006). A nice example is the H1N1 subtype that entered the human population in just prior 1918. The original 1918 strain, recently sequenced from lung tissue found in several victims of the Spanish fl u in Alaska, has an avian nucleotide composition in the set of statistically resolvable segments, which include PB2, PB1, PA, and NP. As we follow the sequenced H1N1 strains for the next 90 years, the composition shifts until it reaches the present day composition, which is entirely human. Computing the rate of substitution from the early strains, the fi nal steady state nucleotide composition for this strain is determined. The nucleotide composition of the present day strains are within the upper bound provided by this calculation (Rabadan, Levine and Robins, 2006).
The observed bias in the rates of fi xation of nucleotides C and U in human versus avian Infl uenza A viruses has three different potential explanations. One possibility is natural selection. The cellular environment in humans favors more U's and fewer C's relative to avian for the Infl uenza virus. Perhaps this could be due to temperature differences which affect RNA structure. However, the evidence presented suggests that the changes are not due to positive selection because most of the mutations are found in third codon positions, consistent with neutral changes. Another possibility is that the RNA-RNA polymerase machinery includes different cellular components in human and avian cells, creating a relative mutation bias. The fi nal possibility, which we fi nd the most intriguing, is that humans have a native defense against RNA viruses that operates in a manner similar to the Apobec family of   Figure 3. The log-odds score of Human and Avian Infl uenza A virus' nucleotide composition from the coding sequences of the polymerase genes versus year. Blue asterisks are Human H1N1 strains, purple squares are H5N1 found in humans, and red pluses are the remaining human strains available from the NCBI database. Green crosses are all the avian strains available in the NCBI database at the time of analysis (Rabadan, Levine and Robins, 2006).
genes (Cullen, 2006;Sawyer, Emerman and Malik, 2004;Yu, Konig, Pillai, Chiles, Kearney, Palmer, Richman, Coffi n and Landau, 2004). The Apobec3G gene is known to cause deamination of Cytosine which results in a Uracil during the retrotranscription of Lentiviruses. The Apobec gene family does not appear to have orthologs in avian species. There have been some additional recent advances in the study of Infl uenza mutation presented in a series of works by Wu and Yan that are not discussed here (Wu and Yan, 2006a;Wu and Yan, 2006b). Also, the role of secondary structure has been studied in recent works, but is beyond the scope of this review (Wei, Du, Sun and Chou, 2006).

Antigenic Shift and Reassortments
When two different Infl uenza A viruses co-infect the same host cell, new virions are released that contain segments from both parental strains (see Fig. 2). This is the main way Infl uenza viruses exchange genetic material, a process known as reassortment (Holmes, Ghedin, Miller, Taylor, Bao, St George, Grenfell, Salzberg, Fraser, Lipman et al. 2005;Lindstrom, Cox and Klimov, 2004;Schweiger, Bruns and Meixenberger, 2006). At least two of the major Infl uenza pandemics of the twentieth century, H2N2 in 1957 and H3N2 in 1968, resulted from reassortments between viruses from two different hosts, avian and human. Within human viruses, reassortments have been related to some of the failures of Infl uenza vaccine prediction, as in 2003. How often do these reassortments occur? Are all possible reassortments equally likely or are there preferred patterns? If we combine two viruses we expect the reassortants to follow a binomial distribution, i.e. roughly four segments from each parental strain.
To answer these questions, M. Lubeck, P. Palese and J. Schulman analyzed 40 reassortant viruses derived from A/PR/8/34 (H1N1) and A/HK/8/68 (H3N2) in the laboratory (Lubeck, Palese and Schulman, 1979). They found strong correlations among segments 1, 2 and 3, 1 and 5, and 3 and 8. Are these results universal? Do they only apply to a particular pair of Infl uenza strains? Can they be found in vivo during local epidemics? Another issue is that the patterns of reassortment observed in vitro, in cell culture, are not subject to immunoselection or other forces that may act in vivo in human hosts. To understand the patterns of reassortment of viral populations one has to provide quantitative answers.
The most traditional way of detecting reassort ments is by constructing phylogenetic trees for the whole genome, as well as for each viral segment, and looking for strains that have segments on different branches of their respective trees (Holmes, Ghedin, Miller, Taylor, Bao, St George, Grenfell, Salzberg, Fraser, Lipman et al. 2005;Lindstrom, 1920Lindstrom, 1940Lindstrom, 1960Lindstrom, 1980Lindstrom, 2000 year 0  (Rabadan, Levine and Robins, 2006). Cox and Klimov, 2004;Lindstrom, Hiromoto, Nerome, Omoe, Sugita, Yamazaki, Takahashi and Nerome, 1998;Nelson, Simonsen, Viboud, Miller, Taylor, George, Griesemer, Ghedin, Sengamalay, Spiro et al. 2006;Schweiger, Bruns and Meixenberger, 2006). There are several limitations to this approach: the structure of a phylogenetic tree depends on the method used to construct it and mutational biases can make accurate phylogenetic analysis very challenging. If our only goal is detecting likely reassortments, there is no need to go through the intermediate step of tree inference. For instance, we can compare genetic distances in pairs of viruses in different segments (Rabadan, Levine and Kraznitz). As viruses replicate over time their sequences change and knowing the evolutionary rates in every segment one can estimate how likely it is that a particular set of distances happens by random chance. For example, let us take two segments, segment 1 (coding for one of the polymerases, PB2) vs segment 3 for human H3N2 Infl uenza strains in New York State (208 sequences from 2000-2003) (See Fig. 5). To avoid selection pressures we only consider third codon positions. We take every pair of virus and we compute the changes between these two viruses in segment 1 and in segment 3. If there are no reassortments the distances should form a straight line and deviations from this line indicate a possible reassortment. In Figure 5, we can see that while although most of the points are along the line with slope one (45 degrees), there are many points lying significantly off the diagonal. Most of the points that reside off the diagonal come from pairs containing a single strain, A/New York/11/2003(H3N2) (in Fig. 5 the pairs of sequences that contain A/New York/11/2003(H3N2) are marked in red). This is a clear indication that A/New York/11/2003(H3N2) is a reassortment that involved segment 1 or segment 3. We can construct statistical tests that measure how probable is that a particular pair deviates from the diagonal and using these tests we can systematically extract all the cases that can be identifi ed as reassortants. With the list of reassortments, we can estimate how likely it is that the reassortment process is random (in this case binomial), what are the correlations between different segments and what are the reassortment rates.
There are several possible explanations for the fact that the reassortment process is not random. The fi rst possible explanation is that the interactions between the different proteins of the virus demand compensatory mutations. For instance, we know that the three polymerases form a complex of proteins and that they work together. Mutations in one amino acid in the interaction domain in one of these polymerases can be compensated by mutations in other polymerase. Another possible explanation comes from the fact that the interaction with the host cell requires host and tissue specifi city. To infect a particular host or cell we need a particular combination of proteins that is well adapted for the growth of the virus in this cell. Another possible explanation comes from the process of packaging the eight different RNAs in the virion. The mechanism controlling how this could happen is not clear although two different hypotheses have been put forward: random packaging and specifi c signals (Bancroft and Parslow, 2002;Fujii, Fujii, Noda, Muramoto, Watanabe, Takada, Goto, Horimoto and Kawaoka, 2005;Fujii, Goto, Watanabe, Yoshida and Kawaoka, 2003;Gog, Afonso Edos, Dalton, Leclercq, Tiley, Elton, von Kirchbach, Naffakh, Escriou and Digard, 2007;Liang, Hong and Parslow, 2005;Muramoto, Takada, Fujii, Noda, Iwatsuki-Horimoto, Watanabe, Horimoto, Kida and Kawaoka, 2006;Noda, Sagara, Yen, Takada, Kida, Cheng and  Mutations accumulate at similar rates in different segments. That makes that most of the pairs are distributed along the diagonal. When reassortments occur this pattern is violated and the exchange of segments produce pairs of viruses where the distances in different segments are not proportional to each other. Reassortments appear as points outside of the diagonal. We can then proceed to analyze the sequences that are the origin of points. In red are the pairs of sequences that contain A/New York/11/ 2003(H3N2) (Rabadan, Levine and Robins, 2006). Odagiri and Tashiro, 1997;Zheng, Palese and Garcia-Sastre, 1996).

Conclusions, Open Problems and Future Directions
This review discusses high mutation rates and reassortments as the main mechanisms of evolution of Infl uenza. Are these the only two modes of evolution of this virus? Two cases have been reported of non-homologous recombination in avian Infl uenza viruses, one in 2002 in Chile and the other in 2004 in British Columbia, Canada (Pasick, Handel, Robinson, Copps, Ridd, Hills, Kehler, Cottam-Birt, Neufeld, Berhane et al. 2005;Suarez, Senne, Banks, Brown, Essen, Lee, Manvell, Mathieu-Benson, Moreno, Pedersen et al. 2004). In both cases, a low pathogenic avian Infl uenza virus (LPAI) mutated into a more virulent form (high pathogenic or HPAI). When these viruses were sequenced the only difference that was found was an extra insertion of a few amino acids in the cleavage site of Hemagglutinin. The extra sequence was incorporated from other segments (MP and NP). Apart from other reported cases in the laboratory it is clear that non-homologous recombination is not a very common phenomenon. More controversial is the possibility of homologous recombination (Chare, Gould and Holmes, 2003). Homologous recombination is an important mode of evolution in retroviruses (e.g. HIV), however for Infl uenza this has not been found in the laboratory.
We have seen how the two main modes of evolution of Infl uenza present non-random patterns. Infl uenza viruses replicating in humans become more U rich in their genome. That allows us to understand a particular direction in the evolution of human Infl uenza, to understand the past and to predict the future of these viruses. Reassortments are not random processes, not in vivo or in vitro. The sequence space of different Infl uenza viruses is enormous and we are only touching the tip of the iceberg. Thanks to the worldwide sequencing effort and the amount of information that is publicly available, we can start answering some of the questions about this virus, its host and its evolution.